136 research outputs found
SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications
Summary: The Smith Waterman (SW) algorithm, which produces the optimal
pairwise alignment between two sequences, is frequently used as a key component
of fast heuristic read mapping and variation detection tools, but current
implementations are either designed as monolithic protein database searching
tools or are embedded into other tools. To facilitate easy integration of the
fast Single Instruction Multiple Data (SIMD) SW algorithm into third party
software, we wrote a C/C++ library, which extends Farrars Striped SW (SSW) to
return alignment information in addition to the optimal SW score. Availability:
SSW is available both as a C/C++ software library, as well as a stand alone
alignment tool wrapping the librarys functionality at
https://github.com/mengyao/Complete- Striped-Smith-Waterman-Library Contact:
[email protected]: 3 pages, 2 figure
Graphical pangenomics
Completely sequencing genomes is expensive, and to save costs we often analyze new genomic data in the context of a reference genome. This approach distorts our image of the inferred genome, an effect which we describe as reference bias. To mitigate reference bias, I repurpose graphical models previously used in genome assembly and alignment to serve as a reference system in resequencing. To do so I formalize the concept of a variation graph to link genomes to a graphical model of their mutual alignment that is capable of representing any kind of genomic variation, both small and large. As this model combines both sequence and variation information in one structure it serves as a natural basis for resequencing. By indexing the topology, sequence space, and haplotype space of these graphs and developing generalizations of sequence alignment suitable to them, I am able to use them as reference systems in the analysis of a wide array of genomic systems, from large vertebrate genomes to microbial pangenomes. To demonstrate the utility of this approach, I use my implementation to solve resequencing and alignment problems in the context of Homo sapiens and Saccharomyces cerevisiae. I use graph visualization techniques to explore variation graphs built from a variety of sources, including diverged human haplotypes, a gut microbiome, and a freshwater viral metagenome. I find that variation aware read alignment can eliminate reference bias at known variants, and this is of particular importance in the analysis of ancient DNA, where existing approaches result in significant bias towards the reference genome and concomitant distortion of population genetics results. I validate that the variation graph model can be applied to align RNA sequencing data to a splicing graph. Finally, I show that a classical pangenomic inference problem in microbiology can be solved using a resequencing approach based on variation graphs.Wellcome Trust PhD fellowshi
Haplotype-aware graph indexes
The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes
Recommended from our members
Haplotype-aware graph indexes.
MOTIVATION: The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. RESULTS: We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108Â 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. AVAILABILITY AND IMPLEMENTATION: Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
A profile in FIRE: resolving the radial distributions of satellite galaxies in the Local Group with simulations
While many tensions between Local Group (LG) satellite galaxies and LCDM
cosmology have been alleviated through recent cosmological simulations, the
spatial distribution of satellites remains an important test of physical models
and physical versus numerical disruption in simulations. Using the FIRE-2
cosmological zoom-in baryonic simulations, we examine the radial distributions
of satellites with Mstar > 10^5 Msun around 8 isolated Milky Way- (MW) mass
host galaxies and 4 hosts in LG-like pairs. We demonstrate that these
simulations resolve the survival and physical destruction of satellites with
Mstar >~ 10^5 Msun. The simulations broadly agree with LG observations,
spanning the radial profiles around the MW and M31. This agreement does not
depend strongly on satellite mass, even at distances <~ 100 kpc. Host-to-host
variation dominates the scatter in satellite counts within 300 kpc of the
hosts, while time variation dominates scatter within 50 kpc. More massive host
galaxies within our sample have fewer satellites at small distances, likely
because of enhanced tidal destruction of satellites via the baryonic disks of
host galaxies. Furthermore, we quantify and provide fits to the tidal depletion
of subhalos in baryonic relative to dark matter-only simulations as a function
of distance. Our simulated profiles imply observational incompleteness in the
LG even at Mstar >~ 10^5 Msun: we predict 2-10 such satellites to be discovered
around the MW and possibly 6-9 around M31. To provide cosmological context, we
compare our results with the radial profiles of satellites around MW analogs in
the SAGA survey, finding that our simulations are broadly consistent with most
SAGA systems.Comment: 18 pages, 10 figures, plus appendices. Main results in figures 2, 3,
and 4. Accepted versio
Genomic diversity and novel genome-wide association with fruit morphology in <i>Capsicum</i>, from 746k polymorphic sites
Capsicum is one of the major vegetable crops grown worldwide. Current subdivision in clades and species is based on morphological traits and coarse sets of genetic markers. Broad variability of fruits has been driven by breeding programs and has been mainly studied by linkage analysis. We discovered 746k variable sites by sequencing 1.8% of the genome in a collection of 373 accessions belonging to 11 Capsicum species from 51 countries. We describe genomic variation at population-level, confirm major subdivision in clades and species, and show that the known major subdivision of C. annuum separates large and bulky fruits from small ones. In C. annuum, we identify four novel loci associated with phenotypes determining the fruit shape, including a non-synonymous mutation in the gene Longifolia 1-like (CA03g16080). Our collection covers all the economically important species of Capsicum widely used in breeding programs and represent the widest and largest study so far in terms of the number of species and number of genetic variants analyzed. We identified a large set of markers that can be used for population genetic studies and genetic association analyses. Our results provide a comprehensive and precise perspective on genomic variability in Capsicum at population-level and suggest that future fine genetic association studies will yield useful results for breeding
The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes
BACKGROUND: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls. RESULTS: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%. CONCLUSIONS: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-1333-7) contains supplementary material, which is available to authorized users
Recommended from our members
Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph
Abstract: Background: During the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA molecules are short and frequently mutated by post-mortem chemical modifications. These features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Alternative approaches have been developed to replace the linear reference with a variation graph which includes known alternative variants at each genetic locus. Here, we evaluate the use of variation graph software vg to avoid reference bias for aDNA and compare with existing methods. Results: We use vg to align simulated and real aDNA samples to a variation graph containing 1000 Genome Project variants and compare with the same data aligned with bwa to the human linear reference genome. Using vg leads to a balanced allelic representation at polymorphic sites, effectively removing reference bias, and more sensitive variant detection in comparison with bwa, especially for insertions and deletions (indels). Alternative approaches that use relaxed bwa parameter settings or filter bwa alignments can also reduce bias but can have lower sensitivity than vg, particularly for indels. Conclusions: Our findings demonstrate that aligning aDNA sequences to variation graphs effectively mitigates the impact of reference bias when analyzing aDNA, while retaining mapping sensitivity and allowing detection of variation, in particular indel variation, that was previously missed
Recommended from our members
Viral coinfection analysis using a MinHash toolkit
Abstract: Background: Human papillomavirus (HPV) is a common sexually transmitted infection associated with cervical cancer that frequently occurs as a coinfection of types and subtypes. Highly similar sublineages that show over 100-fold differences in cancer risk are not distinguishable in coinfections with current typing methods. Results: We describe an efficient set of computational tools, rkmh, for analyzing complex mixed infections of related viruses based on sequence data. rkmh makes extensive use of MinHash similarity measures, and includes utilities for removing host DNA and classifying reads by type, lineage, and sublineage. We show that rkmh is capable of assigning reads to their HPV type as well as HPV16 lineage and sublineages. Conclusions: Accurate read classification enables estimates of percent composition when there are multiple infecting lineages or sublineages. While we demonstrate rkmh for HPV with multiple sequencing technologies, it is also applicable to other mixtures of related sequences
- …